Using Proper Names to Cluster Documents
نویسندگان
چکیده
Proper Names are a frequent occurrence in all types of natural language text. However, the treatment of proper names is an area under-researched by Natural Language Processing. One particular problem is how to link information about the same entity referred to by possibly different proper names in several documents. In this paper we describe a prototype system which first pre-processes individual documents using a simple name-conflation algorithm and then uses an adaptation of Schutze's context-group discrimination algorithm to cluster documents that are judged to contain references to the same named entity. We use this system to assess the potential utility of different contextual
منابع مشابه
Clinical Document Clustering using Multi-view Non-Negative Matrix Factorization
Clinical document contains vital information like symptom names, medication names, age, gender and some demographical information. These information can be used for giving quick relief from a disease. In existing system, they had built a system for clustering symptom names and medication names using Multi-View Non-Negative Matrix Factorization. While considering the clinical documents the facto...
متن کاملIdentification of related multilingual documents using ant clustering algorithms Identificación de documentos multilingües relacionados mediante algoritmos de clustering de hormigas
This paper presents a document representation strategy and a bio-inspired algorithm to cluster multilingual collections of documents in the field of economics and business. The proposed approach allows the user to identify groups of related economics documents written in Spanish and English using techniques inspired on clustering and sorting behaviours observed in some types of ants. In order t...
متن کاملProper name retrieval from diachronic documents for automatic speech transcription using lexical and temporal context
Proper names are usually key to understanding the information contained in a document. Our work focuses on increasing the vocabulary coverage of a speech transcription system by automatically retrieving new proper names from contemporary diachronic text documents. The idea is to use in-vocabulary proper names as an anchor to collect new linked proper names from the diachronic corpus. Our assump...
متن کاملTextual Similarity based on Proper Names
Proper names represent about 10% of English or French newspaper articles. Their quantity and informational quality is already used in different Information Extraction systems. Proper names have widely been studied in the MUC conferences designed to promote research in Information Extraction. We have created our own named entity extraction tool based on a linguistic description with automata. Th...
متن کاملAjout de nouveaux noms propres au vocabulaire d'un système de transcription en utilisant un corpus diachronique
Proper names are usually keys to understand the information contained in a document. Our work focuses on increasing the vocabulary size of a speech transcription system by automatically retrieving proper names from contemporary diachronic text corpus. We assume that some proper names appear in documents relating to the same time period and in similar lexical contexts. We proposed methods that d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002